Adrian Tran, Harshitha Bachina, Hayden King, Josh Shneyder
Linear Regression
Data Visualization
Logarithmic- Visualization
Code
lineargraph =ggplot(data = joined_data, aes(x = lGDP, y = Life_Expectancy)) +geom_point(alpha =0.1, color ="orange") +labs(x ="Ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Relationship between Ln(GDP per capita) and Life Expectancy" ) +geom_smooth(method ="lm")lineargraph
Code
logmodel =lm(Life_Expectancy ~ lGDP, data = joined_data)tidy(logmodel)
\[\widehat{Life Expectancy} = 22.84 + 5.312lGDP\] The logarithmic transformation of GDP per capita makes a linear model fit quite well. The \(R^2\) is moderate, which is pretty good considering how many factors can influence life expectancy across countries and across time. This model predicts that if GDP per capita increases by 1%, mean \(LifeExpectancy\) increases by approximately 0.05312 years. At a GDP per capita of 1 dollar per capita, mean \(LifeExpectancy\) is 22.84 years.
From here on out, we will examine only the relationship between \(lGDP\) and \(Life Expectancy\).
Changes Over Time
Code
years =1959:2019lms_year =matrix(nrow =length(years), ncol =2)lms_year[, 1] = yearscoef_year =matrix(nrow =length(years), ncol =3)coef_year[, 1] = yearsfor (y in years) { year_data = joined_data |>filter(Year == y) year_model =lm(Life_Expectancy ~ lGDP, data = year_data) lms_year[y -1958, 2] =summary(year_model)$r.squared coef_year[y -1958, 2] = year_model$coefficients[1] coef_year[y -1958, 3] = year_model$coefficients[2]}lms_year =data.frame(lms_year)colnames(lms_year) =c("Year", "Rsquared")coef_year =data.frame(coef_year)colnames(coef_year) =c("Year", "Beta0", "Beta1")ggplot(data = lms_year, aes(x = Year, y = Rsquared)) +geom_line(color ="steelblue") +ylim(0, 1) +labs(y ="",subtitle ="R-Squared",title ="R-Squared of ln(GDP per capita) and Life Expectancy over time")
While variation in \(lGDP\) has consistently explained about 60% of variation in \(LifeExpectancy\), the nature of this relationship has changed significantly over time.
Code
library(patchwork)beta0 =ggplot(data = coef_year, aes(x = Year, y=Beta0)) +geom_line(color ="steelblue") +labs(y ="",subtitle ="Estimated Intercept",title ="Estimated Intercept Over Time")beta1 =ggplot(data = coef_year, aes(x = Year, y=Beta1)) +geom_line(color ="steelblue") +labs(y ="",subtitle ="Estimated Slope",title ="Estimated Slope of ln(GDP per capita) Over Time")beta0+beta1
Differences Between Countries
The relationship is pretty stable over time and has been trending upward for the past decade. Outliers can be seen in the noticeable dips in the \(R^2\). These low outliers could also be seen on the previous graphs.
ggplot(data = lms_country, aes(x = Rsquared)) +geom_histogram(binwidth =0.1, fill="steelblue") +labs(x ="R-squared",y ="",subtitle ="Count",title ="Histogram of Countries by R-Squared of ln(GDP per capita) and Life Expectancy" )
While the relationship between \(lGDP\) and \(Life Expectancy\) is quite strong for most countries, for some countries it is much weaker. This is mostly due to events that decrease life expectancy such as epidemics, famines, natural disasters and wars. Botswana for example had a linear relationship between \(lGDP\) and \(Life Expectancy\) until the AIDS crisis hit. This drop in \(Life Expectancy\) while \(lGDP\) continued to grow is responsible for its \(R^2\) of 0.0013.
Code
joined_data |>filter(country =="Botswana") |>ggplot(aes(x = lGDP, y = Life_Expectancy)) +geom_point() +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Life Expectancy and ln(GDP per capita) in Botswana" ) +geom_smooth(method ="lm")
joined_data |>filter(country =="Germany") |>ggplot(aes(x = lGDP, y = Life_Expectancy)) +geom_point() +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Life Expectancy and ln(GDP per capita) in Germany" ) +geom_smooth(method ="lm")
In countries without life expectancy-decreasing events since 1959, there is a very strong linear relationship. In the case Germany, the \(R^2\) is 0.983.
Even after the log transformation of \(GDP\), the data is still a bit nonlinear.
2. Random Sampling/ No Autocorrelation
The assumption of random sampling isn’t met, but we can substitute the assumption of no autocorrelation. The data shows partial autocorrelation of about 0.2 which is relatively stable across various lags, so this assumption isn’t met either. The p-value of the autocorrelation is 0.
Code
acf(joined_data$residuals, type ="correlation")
Code
library(lmtest)dwtest(logmodel)
Durbin-Watson test
data: logmodel
DW = 1.6389, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
3. No colinearity
Since there is only one explanatory variable (\(lGDP\)), there can’t be any colinearity.
4. Mean Independence
Code
ggplot(data = joined_data, aes(x = lGDP, y = residuals)) +geom_point(alpha =0.15, color ="steelblue") +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy Residual",title ="Residuals Plot" )
The assumption of mean independence isn’t met because the explanatory variable can be used to predict the residuals. For high values of \(lGDP\), the residuals are all below 0 for example.
5. Homoskedasticity
Code
library(AER)coeftest(logmodel, vcovHC(logmodel))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.84047 0.37353 61.147 < 2.2e-16 ***
lGDP 5.30161 0.04237 125.127 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
bptest(logmodel)
studentized Breusch-Pagan test
data: logmodel
BP = 301.4, df = 1, p-value < 2.2e-16
\[BP = n \times R^2_{\hat u^2} = 301.4 \]\[9162 \times R^2_{\hat u^2} = 301.4\]\[R^2_{\hat u^2} = 0.033\] While the data isn’t homoskedastic, the Breush-Pagan test statistic shows that only 3.3% of the variance in \(u_i\) can be explained by the variance in \(lGDP\). We can say this because \(var(u_i) = E(u^2_i) - E(u_i)^2\) and \(E(u_i)=0\).
While the distribution is skewed left, it isn’t severe. This is because of the low outliers mentioned earlier.
Estimated Model
As shown previously, the estimated linear regression model is: \[\widehat {Life Expectancy} = 22.84 + 5.312lGDP\] This model predicts that if GDP per capita increases by 1%, mean \(LifeExpectancy\) increases by approximately 0.05312 years. At a GDP per capita of 1 dollar per capita, mean \(LifeExpectancy\) is 22.84 years.
Model Fit
Code
variance_table =matrix(nrow =1, ncol =3)colnames(variance_table) =c("Variance of Response","Response of Fitted Values","Variance of Residuals")variance_table[1, 1] =var(joined_data$Life_Expectancy)variance_table[1, 2] =var(logmodel$fitted.values)variance_table[1, 3] =var(logmodel$residuals)variance_table[1,] =round(variance_table[1,], digits =2)kable(variance_table) |>kable_material_dark()
Variance of Response
Response of Fitted Values
Variance of Residuals
97.63
59.34
38.29
\[R^2 = 1-\frac{38.29}{97.63} = \frac{59.34}{97.63} = 0.608\] The variance in \(lGDP\) accounts for 60.7% of the variance in \(Life Expectancy\). This is a moderate \(R^2\), so this model does a decent job of explaining the response. Considering how many factors may influence \(Life Expectancy\), this is a good quality model.
Code
set.seed(1738)sigma =sigma(logmodel)predictions =predict(logmodel)simulation <-function(x, rse) { x <- x +rnorm(n =length(x), mean =0, sd = rse)return(x)}actualhist =ggplot(data = joined_data, aes(x = Life_Expectancy)) +geom_histogram(binwidth =2, fill ="steelblue")simhist =ggplot(data = joined_data, aes(x =simulation(predictions, sigma))) +geom_histogram(binwidth =2, fill ="steelblue")actualhist + simhist
Because of the left skew of the data, the simulation distribution doesn’t exactly match the actual distribution.
Code
sims =1:1000sim_data =matrix(nrow =length(sims), ncol =2)sim_data[, 1] = simsfor(num in sims) { x = predictions +rnorm(n=length(joined_data$lGDP), mean =0, sd =6.188) simmodel =lm(joined_data$Life_Expectancy ~ x) sim_data[num, 2] =summary(simmodel)$r.squared}hist(sim_data[,2])
DO THIS WITH MAP
Code
library(gganimate)library(gifski)ggplot(data=joined_data, aes(x=lGDP, y=Life_Expectancy, color=country)) +geom_point(alpha =0.7, show.legend =FALSE) +labs(title ='Year: {frame_time}', x ='ln(GDP per capita)', y ='Life Expectancy') +transition_time(Year) +ease_aes('linear')
Source Code
---title: "Final Report"author: "Adrian Tran, Harshitha Bachina, Hayden King, Josh Shneyder"format: html: self-contained: true code-tools: true toc: true code-fold: trueeditor: sourceexecute: error: true echo: true message: false warning: false---# Linear Regression```{r}#| include: falsegdp =read.csv("~/Desktop/stat-331-71-PA6/gdppercapita_us_inflation_adjusted.csv")life =read.csv("~/Desktop/stat-331-71-PA6/life_expectancy_years.csv")library(tidyverse)library(ggplot2)library(broom)library(knitr)library(kableExtra)gdp_clean <- gdp |>pivot_longer(cols ="X1959":"X2019",names_to ="Year",values_to ="GDP")life_clean <- life |>pivot_longer(cols ="X1959":"X2019",names_to ="Year",values_to ="Life_Expectancy") |>select("country", "Year", "Life_Expectancy")joined_data =merge(gdp_clean, life_clean, by =c("Year", "country")) |>mutate(Year =str_remove(Year, "X"),Year =as.numeric(Year),thousand =if_else(str_detect(GDP, "k"), T, F),GDP =as.double(str_remove(GDP, "k")),GDP =if_else(thousand, GDP *1000, GDP),lGDP =log(GDP) ) |>select(-thousand) |>drop_na()```## Data Visualization### Logarithmic- Visualization```{r}lineargraph =ggplot(data = joined_data, aes(x = lGDP, y = Life_Expectancy)) +geom_point(alpha =0.1, color ="orange") +labs(x ="Ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Relationship between Ln(GDP per capita) and Life Expectancy" ) +geom_smooth(method ="lm")lineargraphlogmodel =lm(Life_Expectancy ~ lGDP, data = joined_data)tidy(logmodel)summary(logmodel)$r.squared```$$\widehat{Life Expectancy} = 22.84 + 5.312lGDP$$The logarithmic transformation of GDP per capita makes a linear model fit quite well. The $R^2$ is moderate, which is pretty good considering how many factors can influence life expectancy across countries and across time. This model predicts that if GDP per capita increases by 1%, mean $LifeExpectancy$ increases by approximately 0.05312 years. At a GDP per capita of 1 dollar per capita, mean $LifeExpectancy$ is 22.84 years.From here on out, we will examine only the relationship between $lGDP$ and $Life Expectancy$.### Changes Over Time```{r}years =1959:2019lms_year =matrix(nrow =length(years), ncol =2)lms_year[, 1] = yearscoef_year =matrix(nrow =length(years), ncol =3)coef_year[, 1] = yearsfor (y in years) { year_data = joined_data |>filter(Year == y) year_model =lm(Life_Expectancy ~ lGDP, data = year_data) lms_year[y -1958, 2] =summary(year_model)$r.squared coef_year[y -1958, 2] = year_model$coefficients[1] coef_year[y -1958, 3] = year_model$coefficients[2]}lms_year =data.frame(lms_year)colnames(lms_year) =c("Year", "Rsquared")coef_year =data.frame(coef_year)colnames(coef_year) =c("Year", "Beta0", "Beta1")ggplot(data = lms_year, aes(x = Year, y = Rsquared)) +geom_line(color ="steelblue") +ylim(0, 1) +labs(y ="",subtitle ="R-Squared",title ="R-Squared of ln(GDP per capita) and Life Expectancy over time")```While variation in $lGDP$ has consistently explained about 60% of variation in $LifeExpectancy$, the nature of this relationship has changed significantly over time.```{r}library(patchwork)beta0 =ggplot(data = coef_year, aes(x = Year, y=Beta0)) +geom_line(color ="steelblue") +labs(y ="",subtitle ="Estimated Intercept",title ="Estimated Intercept Over Time")beta1 =ggplot(data = coef_year, aes(x = Year, y=Beta1)) +geom_line(color ="steelblue") +labs(y ="",subtitle ="Estimated Slope",title ="Estimated Slope of ln(GDP per capita) Over Time")beta0+beta1```### Differences Between CountriesThe relationship is pretty stable over time and has been trending upward for the past decade. Outliers can be seen in the noticeable dips in the $R^2$. These low outliers could also be seen on the previous graphs.```{r}joined_data <- joined_data |>mutate(residuals = logmodel$residuals)outliers <- joined_data |>filter(abs(residuals) >20) |>mutate(lGDP =round(lGDP, digits =2),residuals =round(residuals, digits =2))kable( outliers,col.names =c("Year","Country","GDP per Capita","Life Expectancy","ln(GDP per Capita)","Residual" )) |>kable_material_dark("hover")```These high residual data points are all on the low end.They represent the following events:Cameroonian War of Independence and following Civil WarGreat Leap Forward and resulting famineConflict between Hutus and Tutsis in Burundi and Rwanda (culminating in the 1993 Rwandan Genocide)The AIDS crisis in Southern Africa as seen in Botswana and EswatiniA series of hurricanes that hit Haiti in 2009```{r}countries = joined_data |>select(country) |>distinct()lms_country =matrix(nrow =length(countries), ncol =2)lms_country[, 1] = countrieslms_country =data.frame(lms_country)colnames(lms_country) =c("Country", "Rsquared")idnum =1for (c in countries$country) { country_data = joined_data |>filter(country == c) country_model =lm(Life_Expectancy ~ lGDP, data = country_data) lms_country[idnum , "Rsquared"] =summary(country_model)$r.squared idnum = idnum +1}``````{r}ggplot(data = lms_country, aes(x = Rsquared)) +geom_histogram(binwidth =0.1, fill="steelblue") +labs(x ="R-squared",y ="",subtitle ="Count",title ="Histogram of Countries by R-Squared of ln(GDP per capita) and Life Expectancy" )```While the relationship between $lGDP$ and $Life Expectancy$ is quite strong for most countries, for some countries it is much weaker. This is mostly due to events that decrease life expectancy such as epidemics, famines, natural disasters and wars. Botswana for example had a linear relationship between $lGDP$ and $Life Expectancy$ until the AIDS crisis hit. This drop in $Life Expectancy$ while $lGDP$ continued to grow is responsible for its $R^2$ of 0.0013.```{r}joined_data |>filter(country =="Botswana") |>ggplot(aes(x = lGDP, y = Life_Expectancy)) +geom_point() +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Life Expectancy and ln(GDP per capita) in Botswana" ) +geom_smooth(method ="lm")``````{r}lms_country |>mutate(Rsquared =round(Rsquared, digits =3)) |>slice_min(order_by = Rsquared, n =8) |>kable() |>kable_material_dark("hover")``````{r}joined_data |>filter(country =="Germany") |>ggplot(aes(x = lGDP, y = Life_Expectancy)) +geom_point() +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Life Expectancy and ln(GDP per capita) in Germany" ) +geom_smooth(method ="lm")```In countries without life expectancy-decreasing events since 1959, there is a very strong linear relationship. In the case Germany, the $R^2$ is 0.983.```{r}lms_country |>mutate(Rsquared =round(Rsquared, digits =3)) |>slice_max(order_by = Rsquared, n =8) |>kable() |>kable_material_dark("hover")```## Linear Regression### 1. Linearity```{r}lineargraph```Even after the log transformation of $GDP$, the data is still a bit nonlinear.### 2. Random Sampling/ No AutocorrelationThe assumption of random sampling isn't met, but we can substitute the assumption of no autocorrelation. The data shows partial autocorrelation of about 0.2 which is relatively stable across various lags, so this assumption isn't met either. The p-value of the autocorrelation is 0.```{r}acf(joined_data$residuals, type ="correlation")library(lmtest)dwtest(logmodel)```### 3. No colinearitySince there is only one explanatory variable ($lGDP$), there can't be any colinearity.### 4. Mean Independence```{r}ggplot(data = joined_data, aes(x = lGDP, y = residuals)) +geom_point(alpha =0.15, color ="steelblue") +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy Residual",title ="Residuals Plot" )```The assumption of mean independence isn't met because the explanatory variable can be used to predict the residuals. For high values of $lGDP$, the residuals are all below 0 for example.### 5. Homoskedasticity```{r}library(AER)coeftest(logmodel, vcovHC(logmodel))bptest(logmodel)```$$BP = n \times R^2_{\hat u^2} = 301.4 $$$$9162 \times R^2_{\hat u^2} = 301.4$$$$R^2_{\hat u^2} = 0.033$$While the data isn't homoskedastic, the Breush-Pagan test statistic shows that only 3.3% of the variance in $u_i$ can be explained by the variance in $lGDP$. We can say this because $var(u_i) = E(u^2_i) - E(u_i)^2$ and $E(u_i)=0$.### 6. Normality of $u$```{r}ggplot(data = joined_data, aes(x = residuals)) +geom_histogram(fill ="steelblue") +labs(y ="",subtitle ="Count",x ="Residuals",title ="Histogram of Residuals" )```While the distribution is skewed left, it isn't severe. This is because of the low outliers mentioned earlier.### Estimated ModelAs shown previously, the estimated linear regression model is:$$\widehat {Life Expectancy} = 22.84 + 5.312lGDP$$This model predicts that if GDP per capita increases by 1%, mean $LifeExpectancy$ increases by approximately 0.05312 years. At a GDP per capita of 1 dollar per capita, mean $LifeExpectancy$ is 22.84 years.## Model Fit```{r}variance_table =matrix(nrow =1, ncol =3)colnames(variance_table) =c("Variance of Response","Response of Fitted Values","Variance of Residuals")variance_table[1, 1] =var(joined_data$Life_Expectancy)variance_table[1, 2] =var(logmodel$fitted.values)variance_table[1, 3] =var(logmodel$residuals)variance_table[1,] =round(variance_table[1,], digits =2)kable(variance_table) |>kable_material_dark()```$$R^2 = 1-\frac{38.29}{97.63} = \frac{59.34}{97.63} = 0.608$$The variance in $lGDP$ accounts for 60.7% of the variance in $Life Expectancy$. This is a moderate $R^2$, so this model does a decent job of explaining the response. Considering how many factors may influence $Life Expectancy$, this is a good quality model.```{r}set.seed(1738)sigma =sigma(logmodel)predictions =predict(logmodel)simulation <-function(x, rse) { x <- x +rnorm(n =length(x), mean =0, sd = rse)return(x)}actualhist =ggplot(data = joined_data, aes(x = Life_Expectancy)) +geom_histogram(binwidth =2, fill ="steelblue")simhist =ggplot(data = joined_data, aes(x =simulation(predictions, sigma))) +geom_histogram(binwidth =2, fill ="steelblue")actualhist + simhist```Because of the left skew of the data, the simulation distribution doesn't exactly match the actual distribution.```{r}sims =1:1000sim_data =matrix(nrow =length(sims), ncol =2)sim_data[, 1] = simsfor(num in sims) { x = predictions +rnorm(n=length(joined_data$lGDP), mean =0, sd =6.188) simmodel =lm(joined_data$Life_Expectancy ~ x) sim_data[num, 2] =summary(simmodel)$r.squared}hist(sim_data[,2])```DO THIS WITH MAP```{r}library(gganimate)library(gifski)ggplot(data=joined_data, aes(x=lGDP, y=Life_Expectancy, color=country)) +geom_point(alpha =0.7, show.legend =FALSE) +labs(title ='Year: {frame_time}', x ='ln(GDP per capita)', y ='Life Expectancy') +transition_time(Year) +ease_aes('linear')```